Predictive Analytics and Machine Learning Primer

This notebook was put together by **Andrew Greenhut** on **June 3rd 2015**.



In [1]:

    
#Run this code at the beginning of the presentation
from IPython.core.display import Image, display
from fig_code import plot_sgd_separator
from fig_code import plot_linear_regression
import seaborn; seaborn.set()
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline



In [2]:

    
display(Image(filename='images/DS_Cat.jpg'))

Goals of this Presentation

Introduce the basics of Machine Learning, and some skills useful in practice.
Demonstrate Predictive Analytics, to give you ideas of how it can be applied (Optical Character Recognition).



In [3]:

    
display(Image(filename='images/dashboard-snockered-624x418.png'))

What is Machine Learning?

Machine Learning is a computer program that adapts to previously seen data.

Output depends on the algorithm and a set of tunable parameters (known as "hyper-parameters")

Most algorithms fit into two categories: Supervised or Unsupervised

Supervised learning uses labels to train a model on known data, such that it can predict labels for new data.
- e.g. Regression, Classification
Unsupervised learning analyzes all data to tell you general known things (not covered in this presentation)
- e.g. Clustering, Anomaly Detection

There is no magic to Machine Learning. It is all linear algebra of matrices with the goal to minimize an error function

Regression

The model has been learned from the training data, and can be used to predict the result of new data. This might seem like a trivial problem, but it is a basic example of a type of operation that is fundamental to machine learning tasks.



In [4]:

    
plot_linear_regression()

Classification

If you were to drop another point onto the plane, this algorithm could now predict whether it's a blue or a red point.



In [5]:

    
plot_sgd_separator()

Machine learning Data Format

Most machine learning algorithms expect data to be stored in a table (or matrix). The size of the table is [n_samples (rows), n_features (columns)]

n_samples: The number of samples. A sample can be a document, a picture, a sound, a video, a user, a row in database, etc.
n_features: The number of features or distinct traits that can be used to describe each sample. Features are generally continuous, but may be dates, boolean or discrete-valued.

The number of features can be very high dimensional (e.g. millions of features) with most of them being zeros for a given sample.

How is it Implemented?



In [6]:

    
display(Image(filename='images/images.png'))



In [7]:

    
display(Image(filename='images/supervised_learning_flowchart.png'))

Demo: Optical Character Recognition



In [8]:

    
display(Image(filename='images/Predictive-Analytics.png'))

Loading and visualizing the digits data

We'll use scikit-learn's data access interface and take a look at this data:



In [9]:

    
from sklearn import datasets
digits = datasets.load_digits()
digits.images.shape









    Out[9]:





(1797, 8, 8)

Let's plot a few of these:



In [10]:

    
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
    ax.imshow(digits.images[i], cmap='binary')
    ax.text(0.05, 0.05, str(digits.target[i]),
            transform=ax.transAxes, color='green')
    ax.set_xticks([])
    ax.set_yticks([])

Here the data is simply each pixel value within an 8x8 grid:



In [11]:

    
# The images themselves
print(digits.images.shape)
print(digits.images[0])









    



(1797, 8, 8)
[[  0.   0.   5.  13.   9.   1.   0.   0.]
 [  0.   0.  13.  15.  10.  15.   5.   0.]
 [  0.   3.  15.   2.   0.  11.   8.   0.]
 [  0.   4.  12.   0.   0.   8.   8.   0.]
 [  0.   5.   8.   0.   0.   9.   8.   0.]
 [  0.   4.  11.   0.   1.  12.   7.   0.]
 [  0.   2.  14.   5.  10.  12.   0.   0.]
 [  0.   0.   6.  13.  10.   0.   0.   0.]]



In [12]:

    
# The data for use in our algorithms
print(digits.data.shape)
print(digits.data[0])









    



(1797, 64)
[  0.   0.   5.  13.   9.   1.   0.   0.   0.   0.  13.  15.  10.  15.   5.
   0.   0.   3.  15.   2.   0.  11.   8.   0.   0.   4.  12.   0.   0.   8.
   8.   0.   0.   5.   8.   0.   0.   9.   8.   0.   0.   4.  11.   0.   1.
  12.   7.   0.   0.   2.  14.   5.  10.  12.   0.   0.   0.   0.   6.  13.
  10.   0.   0.   0.]



In [13]:

    
# The target label
print(digits.target)









    



[0 1 2 ..., 8 9 8]

So our data have 1797 samples in 64 dimensions.

Unsupervised Learning: Dimensionality Reduction

We'd like to visualize our points within the 64-dimensional parameter space, but it's difficult to plot points in 64 dimensions! Instead we'll reduce the dimensions to 2, using an unsupervised method. Here, we'll make use of a manifold learning algorithm called Isomap, and transform the data to two dimensions.



In [14]:

    
from sklearn.manifold import Isomap



In [15]:

    
iso = Isomap(n_components=2)
data_projected = iso.fit_transform(digits.data)



In [16]:

    
data_projected.shape









    Out[16]:





(1797, 2)



In [17]:

    
plt.scatter(data_projected[:, 0], data_projected[:, 1], c=digits.target,
            edgecolor='none', alpha=0.5, cmap=plt.cm.get_cmap('nipy_spectral', 10));
plt.colorbar(label='digit label', ticks=range(10))
plt.clim(-0.5, 9.5)

We see here that the digits are fairly well-separated in the parameter space; this tells us that a supervised classification algorithm should perform fairly well. Let's give it a try.

Classification on Digits

Let's try a classification task on the digits. The first thing we'll want to do is split the digits into a training and testing sample:



In [18]:

    
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, ytrain, ytest = train_test_split(digits.data, digits.target,
                                                random_state=2)
print(Xtrain.shape, Xtest.shape)









    



(1347, 64) (450, 64)

Let's use a simple logistic regression which (despite its confusing name) is a classification algorithm:



In [19]:

    
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(penalty='l2')
clf.fit(Xtrain, ytrain)
ypred = clf.predict(Xtest)

We can check our classification accuracy by comparing the true values of the test set to the predictions:



In [20]:

    
from sklearn.metrics import accuracy_score
accuracy_score(ytest, ypred)









    Out[20]:





0.94666666666666666

This single number doesn't tell us where we've gone wrong: one nice way to do this is to use the confusion matrix



In [21]:

    
from sklearn.metrics import confusion_matrix
print(confusion_matrix(ytest, ypred))









    



[[42  0  0  0  0  0  0  0  0  0]
 [ 0 45  0  1  0  0  0  0  3  1]
 [ 0  0 47  0  0  0  0  0  0  0]
 [ 0  0  0 42  0  2  0  3  1  0]
 [ 0  2  0  0 36  0  0  0  1  1]
 [ 0  0  0  0  0 52  0  0  0  0]
 [ 0  0  0  0  0  0 42  0  1  0]
 [ 0  0  0  0  0  0  0 48  1  0]
 [ 0  2  0  0  0  0  0  0 38  0]
 [ 0  0  0  1  0  1  0  1  2 34]]



In [22]:

    
plt.imshow(np.log(confusion_matrix(ytest, ypred)),
           cmap='Blues', interpolation='nearest')
plt.grid(False)
plt.ylabel('true')
plt.xlabel('predicted');

We might also take a look at some of the outputs along with their predicted labels. We'll make the bad labels red:



In [23]:

    
fig, axes = plt.subplots(10, 10, figsize=(8, 8))
fig.subplots_adjust(hspace=0.1, wspace=0.1)

for i, ax in enumerate(axes.flat):
    ax.imshow(Xtest[i].reshape(8, 8), cmap='binary')
    ax.text(0.05, 0.05, str(ypred[i]),
            transform=ax.transAxes,
            color='green' if (ytest[i] == ypred[i]) else 'red')
    ax.set_xticks([])
    ax.set_yticks([])

The interesting thing is that even with this simple logistic regression algorithm, many of the mislabeled cases are ones that we ourselves might get wrong!

There are many ways to improve this classifier, but we're out of time here. To go further, we could use a more sophisticated model, use cross validation, or apply other techniques. We'll cover some of these topics later in the tutorial.

Special thanks to [Jake Vanderplas](http://www.vanderplas.com) for his Scikit Learn content. Check out his Pycon 2015 tutorial on [GitHub](https://github.com/jakevdp/sklearn_pycon2015/), or his tutorial [video](http://pyvideo.org/video/3429/machine-learning-with-scikit-learn-i).